# Large-scale Corpus

Randeng Pegasus 523M Summary Chinese V1
A Chinese PEGASUS-large model specialized in text summarization tasks, fine-tuned on multiple Chinese summarization datasets
Text Generation Transformers Chinese
R
IDEA-CCNL
95
5
Ernie 3.0 Mini Zh
ERNIE 3.0 is a large-scale knowledge-enhanced pre-trained model for Chinese language understanding and generation, with the mini version being its lightweight implementation.
Large Language Model Transformers Chinese
E
nghuyong
569
2
Scholarbert
Apache-2.0
BERT-large variant pretrained on large-scale scientific paper collections with 340 million parameters, specializing in scientific literature comprehension
Large Language Model Transformers English
S
globuslabs
25
9
Procbert
ProcBERT is a pre-trained language model specifically optimized for process texts, pre-trained on a large-scale corpus of process texts (including biomedical literature, chemical patents, and cooking recipes), demonstrating outstanding performance in downstream tasks.
Large Language Model Transformers English
P
fbaigt
13
1
Indobert Large P2
MIT
IndoBERT is a state-of-the-art language model developed for Indonesian based on the BERT architecture, trained using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives.
Large Language Model Other
I
indobenchmark
2,272
8
Electra Base Gc4 64k 500000 Cased Generator
MIT
A massive German language model trained on the cleaned German Common Crawl corpus (GC4), totaling approximately 844GB, which may contain biases.
Large Language Model Transformers German
E
stefan-it
16
0
Wav2vec2 Base Nl Voxpopuli
A Wav2Vec2 base model pretrained on the Dutch subset of the VoxPopuli corpus, suitable for Dutch speech recognition tasks.
Speech Recognition Transformers Other
W
facebook
31
0
Bert Large Arabertv2
AraBERT is a pre-trained language model based on Google's BERT architecture, specifically designed for Arabic natural language understanding tasks.
Large Language Model Arabic
B
aubmindlab
334
11
Chinese Mobile Bert
Apache-2.0
This model was pre-trained on a 250-million-word Chinese corpus using the MobileBERT architecture, with a training period of 15 days, completing 1 million iterations on a single A100 GPU.
Large Language Model Transformers
C
Ayou
25
5
Xlm Roberta Large
MIT
XLM-RoBERTa is a multilingual model pretrained on 2.5TB of filtered CommonCrawl data across 100 languages, trained with a masked language modeling objective.
Large Language Model Supports Multiple Languages
X
FacebookAI
5.3M
431
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase